acs <- read.table("http://jaredlander.com/data/acs_ny.csv",sep=",",
header=TRUE, stringsAsFactors=TRUE)
library(ggplot2)
library(useful)
library(caret)
## Loading required package: lattice
library(ISLR)
library(scales)
library(plyr)
library(rpart)
library(rpart.plot)
library(class)
library(MuMIn)
library(caret)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(RColorBrewer)
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:MuMIn':
##
## importance
## The following object is masked from 'package:ggplot2':
##
## margin
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:scales':
##
## alpha
## The following object is masked from 'package:ggplot2':
##
## alpha
library(nlme)
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'lme4'
## The following object is masked from 'package:nlme':
##
## lmList
library(mgcv)
## This is mgcv 1.8-11. For overview type 'help("mgcv-package")'.
Let’s assume our goal is to build a model to predict if household income is greater than $250,000 per year.
We start by building a binary response variable.
acs$HighIncome <- as.numeric(with(acs, FamilyIncome >= 250000))
head(acs)
## Acres FamilyIncome FamilyType NumBedrooms NumChildren NumPeople
## 1 1-10 150 Married 4 1 3
## 2 1-10 180 Female Head 3 2 4
## 3 1-10 280 Female Head 4 0 2
## 4 1-10 330 Female Head 2 1 2
## 5 1-10 330 Male Head 3 1 2
## 6 1-10 480 Male Head 0 3 4
## NumRooms NumUnits NumVehicles NumWorkers OwnRent YearBuilt
## 1 9 Single detached 1 0 Mortgage 1950-1959
## 2 6 Single detached 2 0 Rented Before 1939
## 3 8 Single detached 3 1 Mortgage 2000-2004
## 4 4 Single detached 1 0 Rented 1950-1959
## 5 5 Single attached 1 0 Mortgage Before 1939
## 6 1 Single detached 0 0 Rented Before 1939
## HouseCosts ElectricBill FoodStamp HeatingFuel Insurance Language
## 1 1800 90 No Gas 2500 English
## 2 850 90 No Oil 0 English
## 3 2600 260 No Oil 6600 Other European
## 4 1800 140 No Oil 0 English
## 5 860 150 No Gas 660 Spanish
## 6 700 140 No Gas 0 English
## HighIncome
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
tail(acs)
## Acres FamilyIncome FamilyType NumBedrooms NumChildren NumPeople
## 22740 10+ 556250 Married 4 0 2
## 22741 10+ 565000 Married 5 3 5
## 22742 10+ 599000 Married 4 0 2
## 22743 10+ 611700 Married 4 1 5
## 22744 10+ 621430 Married 3 2 4
## 22745 10+ 751200 Married 4 2 4
## NumRooms NumUnits NumVehicles NumWorkers OwnRent YearBuilt
## 22740 10 Single detached 2 1 Mortgage Before 1939
## 22741 10 Single detached 2 2 Mortgage 1990-1999
## 22742 6 Single detached 2 2 Mortgage Before 1939
## 22743 9 Single detached 5 3 Mortgage Before 1939
## 22744 11 Single detached 2 3 Mortgage 1970-1979
## 22745 10 Single detached 2 2 Mortgage 1940-1949
## HouseCosts ElectricBill FoodStamp HeatingFuel Insurance Language
## 22740 1800 380 No Oil 1700 English
## 22741 1700 370 No Gas 1000 English
## 22742 1300 100 No Gas 3500 English
## 22743 410 100 No Oil 1300 Spanish
## 22744 1600 80 No Gas 800 Spanish
## 22745 6500 400 No Oil 2000 English
## HighIncome
## 22740 1
## 22741 1
## 22742 1
## 22743 1
## 22744 1
## 22745 1
acs$foodstamp_binary <- ifelse(acs$FoodStamp == "Yes",1,0) # (yes = 1, no = 0)
# Option 2 for doing this - I did not use this. Does not add new column but modifes exisiting column
# levels(acs$FoodStamp) <- c("0","1")
acs$own_home <- ifelse(acs$OwnRent == "Rented",0, ifelse(acs$FamilyIncome == "Mortgage",1,2)) # (own = 1, rent = 0)
acs$family_type_cat <- ifelse(acs$FamilyType == "Married",1, ifelse(acs$FamilyIncome == "Female Head",2,3))
# married = 1, male head = 2, female head = 3
acs$InsuranceHigh <- (acs$Insurance > 1000) * 1
acs$NumWorkers2 <- (acs$NumWorkers == 2) * 1
acs$HouseCostsHigh <- (acs$HouseCosts > 1000) * 1
acs$high_electric <- (acs$ElectricBill > 350) * 1
set.seed(447)
testrecs <- sample(nrow(acs),0.2 * nrow(acs))
acs_test <- acs[testrecs,]
acs_fit <- acs[-testrecs,]
Create binary variable where 1 = not on food stamps & not renting & married
acs$HI_pred1 <- 0
acs$HI_pred1[acs_test$FoodStamp == 'No' & acs_test$OwnRent != 'Rented' & acs_test$FamilyType == 'Married'] <- 1
# I like this visualization for a quick visual of what I'm dealing with before digging in (modified from class notes)
ggplot(acs,aes(x=FamilyIncome)) + geom_density(fill="#31a354", color="#31a354") +
geom_vline(xintercept=250000) + scale_x_continuous(label=multiple.dollar, limits=c(0,1000000))
## Warning: Removed 13 rows containing non-finite values (stat_density).
If p-value is less than .05 (or chosen significance level) then sample is NOT normally distributed This is not relevant here, but worth keeping for future The plot shows that there is a clear left skewned distribution - expected but still nice to include
shapiro.test(acs_test$FamilyIncome)
##
## Shapiro-Wilk normality test
##
## data: acs_test$FamilyIncome
## W = 0.71538, p-value < 2.2e-16
Before trying to build any classification models, you should do some exploratory data analysis to try to get a sense of which variables might be useful for trying to predict cases where FamilyIncome >= 250000. You should use a combination of group by analysis (i.e. plyr or dplyr or similar) and plotting.
If you decide you’d like to create some new variables (feature engineering), feel free to do that. Just document what you did and why you did it.
# Get some summary stats on each variable
summary(acs_fit)
## Acres FamilyIncome FamilyType NumBedrooms
## 10+ : 815 Min. : 50 Female Head: 2594 Min. :0.000
## 1-10 : 3709 1st Qu.: 53000 Male Head : 907 1st Qu.:3.000
## Sub 1:13672 Median : 87100 Married :14695 Median :3.000
## Mean : 110726 Mean :3.384
## 3rd Qu.: 134082 3rd Qu.:4.000
## Max. :1605000 Max. :8.000
##
## NumChildren NumPeople NumRooms NumUnits
## Min. : 0.0000 Min. : 2.0 Min. : 1.00 Mobile home : 583
## 1st Qu.: 0.0000 1st Qu.: 2.0 1st Qu.: 6.00 Single attached: 1916
## Median : 0.0000 Median : 3.0 Median : 7.00 Single detached:15697
## Mean : 0.9013 Mean : 3.4 Mean : 7.18
## 3rd Qu.: 2.0000 3rd Qu.: 4.0 3rd Qu.: 8.00
## Max. :12.0000 Max. :18.0 Max. :21.00
##
## NumVehicles NumWorkers OwnRent YearBuilt
## Min. :0.000 Min. :0.000 Mortgage:16074 Before 1939:4874
## 1st Qu.:2.000 1st Qu.:1.000 Outright: 128 1950-1959 :3107
## Median :2.000 Median :2.000 Rented : 1994 1960-1969 :2201
## Mean :2.119 Mean :1.748 1970-1979 :1894
## 3rd Qu.:3.000 3rd Qu.:2.000 1990-1999 :1647
## Max. :6.000 Max. :3.000 1980-1989 :1598
## (Other) :2875
## HouseCosts ElectricBill FoodStamp HeatingFuel
## Min. : 4 Min. : 1.0 No :16777 Gas :10899
## 1st Qu.: 650 1st Qu.:100.0 Yes: 1419 Oil : 5187
## Median :1200 Median :150.0 Wood : 986
## Mean :1480 Mean :174.4 Electricity: 831
## 3rd Qu.:2000 3rd Qu.:220.0 Other : 148
## Max. :7090 Max. :580.0 Coal : 124
## (Other) : 21
## Insurance Language HighIncome
## Min. : 0.0 Asian Pacific : 639 Min. :0.00000
## 1st Qu.: 400.0 English :14262 1st Qu.:0.00000
## Median : 720.0 Other : 225 Median :0.00000
## Mean : 958.9 Other European: 1640 Mean :0.06139
## 3rd Qu.:1200.0 Spanish : 1430 3rd Qu.:0.00000
## Max. :6600.0 Max. :1.00000
##
## foodstamp_binary own_home family_type_cat InsuranceHigh
## Min. :0.00000 Min. :0.000 Min. :1.000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:0.000
## Median :0.00000 Median :2.000 Median :1.000 Median :0.000
## Mean :0.07798 Mean :1.781 Mean :1.385 Mean :0.326
## 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :1.00000 Max. :2.000 Max. :3.000 Max. :1.000
##
## NumWorkers2 HouseCostsHigh high_electric
## Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :1.0000 Median :0.00000
## Mean :0.4768 Mean :0.5433 Mean :0.06666
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.00000
##
# see that those that those that own home correlate with higher incomes overall
ggplot(acs_fit) + geom_histogram(aes(x=own_home), fill = "gray")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# scatter number of workers and family income
ggplot(data=acs_fit) + geom_point(aes(x=NumWorkers, y=FamilyIncome))
# scatter plot shows that those not on foodstamps tend to have higher income = duh, but relevant for model later
ggplot(data=acs_fit) + geom_point(aes(x=foodstamp_binary, y=FamilyIncome))
# plot shows that homes with 2 workers correlate with higher incomes vs other number of workers
ggplot(data=acs_fit) + geom_point(aes(x=NumWorkers, y=FamilyIncome))
# notice that there are very few observations with male head type. Female head has lower income and married highest incomes
ggplot(data=acs_fit) + geom_point(aes(x=family_type_cat, y=FamilyIncome))
# scatter house costs and family income - see that higher house costs correlate to higher incomes (slightly) - nothin major though
ggplot(data=acs_fit) + geom_point(aes(x=HouseCosts, y=FamilyIncome))
# create matrix of scatterplots
# pairs(acs[,1:19])
coor_cartesian -> Setting limits on the coordinate system will zoom the plot (like you’re looking at it with a magnifying glass), and will not change the underlying data like setting limits on a scale will.
# See that outliers begin roughly around income of $100,000
ggplot(data=acs_fit) + geom_boxplot(aes(x=NumWorkers, y=FamilyIncome)) + coord_cartesian(ylim = c(0, 350000))
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
These show the density by variable on axis. These are useful to see the concentration range of values
ggplot(acs_fit) + geom_density(aes(x=acs_fit$FamilyIncome)) + scale_x_continuous(labels=dollar)
ggplot(acs_fit) + geom_density(aes(x=acs_fit$HouseCosts)) + scale_x_continuous(labels=dollar)
ggplot(acs_fit) + geom_density(aes(x=acs_fit$NumChildren)) + scale_x_continuous()
ggplot(acs_fit) + geom_density(aes(x=acs_fit$FamilyIncome)) + scale_x_log10(breaks =c(100,1000,10000,100000), labels=dollar) + annotation_logticks(sides="bt")
ggplot(acs_fit) + geom_density(aes(x=acs_fit$HouseCosts)) + scale_x_log10(breaks =c(100,1000,10000,100000), labels=dollar) + annotation_logticks(sides="bt")
# shows positive correlation between insurance and family income
ggplot(acs_fit, aes(x=acs_fit$Insurance, y=acs_fit$FamilyIncome)) +geom_point() + geom_smooth()
## `geom_smooth()` using method = 'gam'
# density plot for electrical bil
ggplot(acs_fit) + geom_density(aes(x=acs_fit$ElectricBill)) + scale_x_log10(breaks =c(100,1000,10000,100000), labels=dollar) + annotation_logticks(sides="bt")
# shows positive correlation between electric bill and family income
ggplot(acs_fit, aes(x=acs_fit$ElectricBill, y=acs_fit$FamilyIncome)) +geom_point() + geom_smooth()
## `geom_smooth()` using method = 'gam'
# This shows a good spread or range of each family type group. This will lend itself well to being included in my analysis
ddply(acs_fit,.(FamilyType),summarise,family_type_count=length(FamilyIncome))
## FamilyType family_type_count
## 1 Female Head 2594
## 2 Male Head 907
## 3 Married 14695
# Interesting look at mean income of family type grouped with home ownership type
ddply(acs_fit,.(FamilyType,OwnRent), summarise, mean_income=mean(FamilyIncome))
## FamilyType OwnRent mean_income
## 1 Female Head Mortgage 69983.15
## 2 Female Head Outright 92303.57
## 3 Female Head Rented 34258.14
## 4 Male Head Mortgage 79298.16
## 5 Male Head Outright 213250.00
## 6 Male Head Rented 47620.92
## 7 Married Mortgage 126152.48
## 8 Married Outright 110020.36
## 9 Married Rented 71331.00
ddply(acs_fit,.(FamilyType,FoodStamp), summarise, mean_income=mean(FamilyIncome))
## FamilyType FoodStamp mean_income
## 1 Female Head No 67144.55
## 2 Female Head Yes 37105.47
## 3 Male Head No 78679.59
## 4 Male Head Yes 45825.75
## 5 Married No 124832.07
## 6 Married Yes 61612.91
# simple look at mean income by foodstamp. Obvious results, but at the same time surprising to find that mean income
# for those on food stamps in near $50k
ddply(acs_fit,.(FoodStamp), summarise, mean_income=mean(FamilyIncome))
## FoodStamp mean_income
## 1 No 115898.77
## 2 Yes 49572.72
ddply(acs_fit,.(FoodStamp,NumBedrooms), summarise, mean_income=mean(FamilyIncome), num_bedrooms=mean(NumBedrooms))
## FoodStamp NumBedrooms mean_income num_bedrooms
## 1 No 0 91092.92 0
## 2 No 1 68606.42 1
## 3 No 2 80216.78 2
## 4 No 3 102048.19 3
## 5 No 4 137687.35 4
## 6 No 5 173357.12 5
## 7 No 8 182832.27 8
## 8 Yes 0 36577.50 0
## 9 Yes 1 22038.06 1
## 10 Yes 2 33118.22 2
## 11 Yes 3 45066.36 3
## 12 Yes 4 62828.82 4
## 13 Yes 5 60499.15 5
## 14 Yes 8 82773.62 8
# This is a little excessive, but would be useful for piping it to a csv file (for example) if relevant to tasks at hand
##ddply(acs,.(NumBedrooms,NumChildren,NumPeople,NumRooms,NumUnits,NumVehicles,NumWorkers), summarise, mean_income=mean(FamilyIncome))
ddply(acs_fit,.(OwnRent), summarise, mean_income=mean(FamilyIncome))
## OwnRent mean_income
## 1 Mortgage 117551.80
## 2 Outright 109695.55
## 3 Rented 55771.63
# Family Type
tapply(acs_fit$FamilyIncome,acs_fit$FamilyType,length)
## Female Head Male Head Married
## 2594 907 14695
tapply(acs_fit$FamilyIncome,acs_fit$FamilyType,mean)
## Female Head Male Head Married
## 60242.74 72992.65 121966.88
# Own/Rent
tapply(acs_fit$FamilyIncome,acs_fit$OwnRent,length)
## Mortgage Outright Rented
## 16074 128 1994
tapply(acs_fit$FamilyIncome,acs_fit$OwnRent,mean)
## Mortgage Outright Rented
## 117551.80 109695.55 55771.63
# Insurance
tapply(acs_fit$FamilyIncome,acs_fit$FoodStamp,length)
## No Yes
## 16777 1419
tapply(acs_fit$FamilyIncome,acs_fit$FoodStamp,mean)
## No Yes
## 115898.77 49572.72
Let’s start by building a null model in which you simply predict that everyone’s income is < 250000 (since the majority of incomes are less than 250000).
acs$null_model <- 0
Create a confusion matrix table and compute the overall accuracy of this model as well as its sensitivity and specificity.
library(caret)
table(acs$HighIncome, acs$null_model)
##
## 0
## 0 21356
## 1 1389
prop.table(table(acs$HighIncome, acs$null_model))
##
## 0
## 0 0.93893163
## 1 0.06106837
confusionMatrix(as.factor(acs$null_model), as.factor(acs$HighIncome), positive = "1")
## Warning in confusionMatrix.default(as.factor(acs$null_model), as.factor(acs
## $HighIncome), : Levels are not in the same order for reference and data.
## Refactoring data to match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 21356 1389
## 1 0 0
##
## Accuracy : 0.9389
## 95% CI : (0.9357, 0.942)
## No Information Rate : 0.9389
## P-Value [Acc > NIR] : 0.5071
##
## Kappa : 0
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.00000
## Specificity : 1.00000
## Pos Pred Value : NaN
## Neg Pred Value : 0.93893
## Prevalence : 0.06107
## Detection Rate : 0.00000
## Detection Prevalence : 0.00000
## Balanced Accuracy : 0.50000
##
## 'Positive' Class : 1
##
We would like to build a more accurate model than this. Your job is to build classifiers to predict the binary HighIncome we created. You will be using three different classification techniques: * decision trees (use rpart package - see Kaggle Titanic example from StatModels2 or session on trees) * logistic regression (see logistic regression examples we did in StatModels2) * k-nearest neighbor or some other technique (see kNN example we did in StatModels2) For each technique, you should: * build a few models with the training data * create confusion matrices (using caret package) to pick the best fit model for each technique * use your three best fit models (one from each technique) to predict using the test dataset and evaluate which of the models performs the best * write a few paragraphs discussing what you did and what you found. In particular, how difficult is it to predict HighIncome? Did one of the techniques outperform the other two?
logmod1 <- glm(HighIncome ~ FamilyType + NumVehicles + OwnRent + Insurance + YearBuilt, data=acs_fit,
family=binomial(link="logit"))
summary(logmod1)
##
## Call:
## glm(formula = HighIncome ~ FamilyType + NumVehicles + OwnRent +
## Insurance + YearBuilt, family = binomial(link = "logit"),
## data = acs_fit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9844 -0.3540 -0.2875 -0.1979 3.3452
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.606e+01 1.539e+02 -0.104 0.917
## FamilyTypeMale Head 6.722e-01 3.116e-01 2.157 0.031 *
## FamilyTypeMarried 1.674e+00 2.035e-01 8.225 < 2e-16 ***
## NumVehicles 2.463e-01 3.374e-02 7.300 2.88e-13 ***
## OwnRentOutright 7.038e-01 3.182e-01 2.212 0.027 *
## OwnRentRented -1.159e-01 1.937e-01 -0.598 0.550
## Insurance 6.605e-04 2.212e-05 29.855 < 2e-16 ***
## YearBuilt1940-1949 9.780e+00 1.539e+02 0.064 0.949
## YearBuilt1950-1959 1.034e+01 1.539e+02 0.067 0.946
## YearBuilt1960-1969 1.036e+01 1.539e+02 0.067 0.946
## YearBuilt1970-1979 1.015e+01 1.539e+02 0.066 0.947
## YearBuilt1980-1989 1.049e+01 1.539e+02 0.068 0.946
## YearBuilt1990-1999 1.050e+01 1.539e+02 0.068 0.946
## YearBuilt2000-2004 1.061e+01 1.539e+02 0.069 0.945
## YearBuilt2005 1.071e+01 1.539e+02 0.070 0.945
## YearBuilt2006 1.105e+01 1.539e+02 0.072 0.943
## YearBuilt2007 1.074e+01 1.539e+02 0.070 0.944
## YearBuilt2008 1.080e+01 1.539e+02 0.070 0.944
## YearBuilt2009 1.046e+01 1.539e+02 0.068 0.946
## YearBuilt2010 1.135e+01 1.539e+02 0.074 0.941
## YearBuiltBefore 1939 1.033e+01 1.539e+02 0.067 0.946
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8398.1 on 18195 degrees of freedom
## Residual deviance: 7076.0 on 18175 degrees of freedom
## AIC: 7118
##
## Number of Fisher Scoring iterations: 12
acs_test$yhat_logmod1 <- predict(logmod1, newdata=acs_test, type='response')
acs_test$yhat_logmod1 <- (acs_test$yhat_logmod1 > 0.05) * 1
log_mod1_aic <- summary(logmod1)$aic
log_cm1 <- confusionMatrix(as.factor(acs_test$yhat_logmod1), as.factor(acs_test$HighIncome), positive = "1")
log_cm1
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2662 57
## 1 1615 215
##
## Accuracy : 0.6324
## 95% CI : (0.6182, 0.6465)
## No Information Rate : 0.9402
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1121
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.79044
## Specificity : 0.62240
## Pos Pred Value : 0.11749
## Neg Pred Value : 0.97904
## Prevalence : 0.05979
## Detection Rate : 0.04726
## Detection Prevalence : 0.40229
## Balanced Accuracy : 0.70642
##
## 'Positive' Class : 1
##
logmod2 <- glm(HighIncome ~ FamilyType + FoodStamp + OwnRent, data=acs_fit, family=binomial(link="logit"))
summary(logmod2)
##
## Call:
## glm(formula = HighIncome ~ FamilyType + FoodStamp + OwnRent,
## family = binomial(link = "logit"), data = acs_fit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.4992 -0.4048 -0.4048 -0.1728 3.7799
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.1971 0.1947 -21.557 < 2e-16 ***
## FamilyTypeMale Head 0.6399 0.3026 2.114 0.0345 *
## FamilyTypeMarried 1.7363 0.1968 8.821 < 2e-16 ***
## FoodStampYes -1.9012 0.3578 -5.314 1.08e-07 ***
## OwnRentOutright 0.4413 0.2964 1.489 0.1365
## OwnRentRented -1.0447 0.1885 -5.541 3.00e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8398.1 on 18195 degrees of freedom
## Residual deviance: 8037.1 on 18190 degrees of freedom
## AIC: 8049.1
##
## Number of Fisher Scoring iterations: 7
acs_test$yhat_logmod2 <- predict(logmod2, newdata=acs_test, type='response')
acs_test$yhat_logmod2 <- (acs_test$yhat_logmod2 > 0.05) * 1
log_mod2_aic <- summary(logmod2)$aic
log_cm2 <- confusionMatrix(as.factor(acs_test$yhat_logmod2), as.factor(acs_test$HighIncome), positive = "1")
log_cm2
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1265 28
## 1 3012 244
##
## Accuracy : 0.3317
## 95% CI : (0.318, 0.3456)
## No Information Rate : 0.9402
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0314
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.89706
## Specificity : 0.29577
## Pos Pred Value : 0.07494
## Neg Pred Value : 0.97834
## Prevalence : 0.05979
## Detection Rate : 0.05364
## Detection Prevalence : 0.71576
## Balanced Accuracy : 0.59641
##
## 'Positive' Class : 1
##
logmod3 <- glm(HighIncome ~ InsuranceHigh + NumWorkers2 + HouseCostsHigh + FoodStamp + OwnRent,
data=acs_fit, family=binomial(link="logit"))
summary(logmod3)
##
## Call:
## glm(formula = HighIncome ~ InsuranceHigh + NumWorkers2 + HouseCostsHigh +
## FoodStamp + OwnRent, family = binomial(link = "logit"), data = acs_fit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2253 -0.3183 -0.2682 -0.1640 3.6467
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.30241 0.09797 -43.915 < 2e-16 ***
## InsuranceHigh 1.34537 0.07504 17.929 < 2e-16 ***
## NumWorkers2 0.07590 0.06409 1.184 0.2363
## HouseCostsHigh 1.26356 0.09694 13.035 < 2e-16 ***
## FoodStampYes -2.07741 0.35866 -5.792 6.95e-09 ***
## OwnRentOutright 1.72956 0.31359 5.515 3.48e-08 ***
## OwnRentRented -0.34415 0.19554 -1.760 0.0784 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8398.1 on 18195 degrees of freedom
## Residual deviance: 7305.5 on 18189 degrees of freedom
## AIC: 7319.5
##
## Number of Fisher Scoring iterations: 7
acs_test$yhat_logmod3 <- predict(logmod3, newdata=acs_test, type='response')
acs_test$yhat_logmod3 <- (acs_test$yhat_logmod3 > 0.05) * 1
log_mod3_aic <- summary(logmod3)$aic
log_cm3 <- confusionMatrix(as.factor(acs_test$yhat_logmod3), as.factor(acs_test$HighIncome), positive = "1")
log_cm3
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3090 72
## 1 1187 200
##
## Accuracy : 0.7232
## 95% CI : (0.71, 0.7362)
## No Information Rate : 0.9402
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1568
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.73529
## Specificity : 0.72247
## Pos Pred Value : 0.14420
## Neg Pred Value : 0.97723
## Prevalence : 0.05979
## Detection Rate : 0.04397
## Detection Prevalence : 0.30490
## Balanced Accuracy : 0.72888
##
## 'Positive' Class : 1
##
logmod4 <- glm(HighIncome ~ InsuranceHigh + NumWorkers2 + HouseCostsHigh, data=acs_fit, family=binomial(link="logit"))
summary(logmod4)
##
## Call:
## glm(formula = HighIncome ~ InsuranceHigh + NumWorkers2 + HouseCostsHigh,
## family = binomial(link = "logit"), data = acs_fit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5958 -0.3163 -0.2879 -0.1577 2.9642
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.38071 0.09202 -47.604 <2e-16 ***
## InsuranceHigh 1.41041 0.07252 19.447 <2e-16 ***
## NumWorkers2 0.11334 0.06374 1.778 0.0753 .
## HouseCostsHigh 1.21827 0.09407 12.950 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8398.1 on 18195 degrees of freedom
## Residual deviance: 7404.8 on 18192 degrees of freedom
## AIC: 7412.8
##
## Number of Fisher Scoring iterations: 6
acs_test$yhat_logmod4 <- predict(logmod4, newdata=acs_test, type='response')
acs_test$yhat_logmod4 <- (acs_test$yhat_logmod4 > 0.05) * 1
log_mod4_aic <- summary(logmod4)$aic
log_cm4 <- confusionMatrix(as.factor(acs_test$yhat_logmod4), as.factor(acs_test$HighIncome), positive = "1")
log_cm4
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3072 73
## 1 1205 199
##
## Accuracy : 0.7191
## 95% CI : (0.7058, 0.7321)
## No Information Rate : 0.9402
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1526
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.73162
## Specificity : 0.71826
## Pos Pred Value : 0.14174
## Neg Pred Value : 0.97679
## Prevalence : 0.05979
## Detection Rate : 0.04375
## Detection Prevalence : 0.30864
## Balanced Accuracy : 0.72494
##
## 'Positive' Class : 1
##
logmod5 <- glm(HighIncome ~ FamilyType + NumBedrooms + NumChildren + NumPeople + NumRooms + NumUnits + NumVehicles +
NumWorkers + OwnRent + HouseCosts + ElectricBill + FoodStamp + Insurance + Language +
InsuranceHigh + NumWorkers2 + HouseCostsHigh, data=acs_fit, family=binomial(link="logit"))
summary(logmod5)
##
## Call:
## glm(formula = HighIncome ~ FamilyType + NumBedrooms + NumChildren +
## NumPeople + NumRooms + NumUnits + NumVehicles + NumWorkers +
## OwnRent + HouseCosts + ElectricBill + FoodStamp + Insurance +
## Language + InsuranceHigh + NumWorkers2 + HouseCostsHigh,
## family = binomial(link = "logit"), data = acs_fit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5995 -0.3188 -0.2038 -0.1272 3.6788
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.020e+01 1.550e+02 -0.130 0.896356
## FamilyTypeMale Head 5.779e-01 3.213e-01 1.798 0.072102 .
## FamilyTypeMarried 1.436e+00 2.069e-01 6.937 4.00e-12 ***
## NumBedrooms 1.191e-01 3.507e-02 3.396 0.000685 ***
## NumChildren 1.853e-01 5.317e-02 3.486 0.000491 ***
## NumPeople -2.690e-01 5.039e-02 -5.338 9.39e-08 ***
## NumRooms 1.064e-01 1.426e-02 7.461 8.61e-14 ***
## NumUnitsSingle attached 1.320e+01 1.550e+02 0.085 0.932166
## NumUnitsSingle detached 1.284e+01 1.550e+02 0.083 0.933970
## NumVehicles 2.531e-01 4.244e-02 5.963 2.48e-09 ***
## NumWorkers 1.778e-01 5.742e-02 3.096 0.001964 **
## OwnRentOutright 1.930e+00 3.441e-01 5.609 2.04e-08 ***
## OwnRentRented 3.982e-01 2.026e-01 1.965 0.049373 *
## HouseCosts 4.985e-04 2.945e-05 16.926 < 2e-16 ***
## ElectricBill 2.001e-03 2.980e-04 6.714 1.89e-11 ***
## FoodStampYes -1.255e+00 3.727e-01 -3.367 0.000760 ***
## Insurance 2.365e-04 3.125e-05 7.568 3.79e-14 ***
## LanguageEnglish -2.851e-01 1.629e-01 -1.750 0.080110 .
## LanguageOther -1.487e-01 2.899e-01 -0.513 0.607872
## LanguageOther European -1.955e-01 1.839e-01 -1.063 0.287918
## LanguageSpanish -6.078e-01 2.096e-01 -2.899 0.003739 **
## InsuranceHigh 4.681e-01 9.381e-02 4.990 6.05e-07 ***
## NumWorkers2 3.137e-02 7.781e-02 0.403 0.686814
## HouseCostsHigh 1.797e-01 1.155e-01 1.555 0.119832
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8398.1 on 18195 degrees of freedom
## Residual deviance: 6140.7 on 18172 degrees of freedom
## AIC: 6188.7
##
## Number of Fisher Scoring iterations: 16
acs_test$yhat_logmod5 <- predict(logmod5, newdata=acs_test, type='response')
acs_test$yhat_logmod5 <- (acs_test$yhat_logmod5 > 0.05) * 1
log_mod5_aic <- summary(logmod5)$aic
log_cm5 <- confusionMatrix(as.factor(acs_test$yhat_logmod5), as.factor(acs_test$HighIncome), positive = "1")
log_cm5
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3145 50
## 1 1132 222
##
## Accuracy : 0.7402
## 95% CI : (0.7272, 0.7529)
## No Information Rate : 0.9402
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1927
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.81618
## Specificity : 0.73533
## Pos Pred Value : 0.16396
## Neg Pred Value : 0.98435
## Prevalence : 0.05979
## Detection Rate : 0.04880
## Detection Prevalence : 0.29765
## Balanced Accuracy : 0.77575
##
## 'Positive' Class : 1
##
logmod6 <- glm(HighIncome ~ FamilyType + NumBedrooms + NumChildren + OwnRent +
HouseCosts + ElectricBill + FoodStamp + InsuranceHigh,
data=acs_fit, family=binomial(link="logit"))
summary(logmod6)
##
## Call:
## glm(formula = HighIncome ~ FamilyType + NumBedrooms + NumChildren +
## OwnRent + HouseCosts + ElectricBill + FoodStamp + InsuranceHigh,
## family = binomial(link = "logit"), data = acs_fit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1642 -0.3266 -0.2067 -0.1488 3.8673
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.125e+00 2.339e-01 -30.467 < 2e-16 ***
## FamilyTypeMale Head 5.904e-01 3.158e-01 1.869 0.0616 .
## FamilyTypeMarried 1.624e+00 2.028e-01 8.006 1.19e-15 ***
## NumBedrooms 2.568e-01 2.700e-02 9.514 < 2e-16 ***
## NumChildren -7.037e-02 2.962e-02 -2.376 0.0175 *
## OwnRentOutright 1.963e+00 3.122e-01 6.287 3.24e-10 ***
## OwnRentRented 5.207e-02 1.983e-01 0.263 0.7929
## HouseCosts 5.746e-04 2.467e-05 23.294 < 2e-16 ***
## ElectricBill 2.358e-03 2.826e-04 8.341 < 2e-16 ***
## FoodStampYes -1.764e+00 3.640e-01 -4.846 1.26e-06 ***
## InsuranceHigh 7.961e-01 8.077e-02 9.856 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8398.1 on 18195 degrees of freedom
## Residual deviance: 6352.8 on 18185 degrees of freedom
## AIC: 6374.8
##
## Number of Fisher Scoring iterations: 8
acs_test$yhat_logmod6 <- predict(logmod6, newdata=acs_test, type='response')
acs_test$yhat_logmod6 <- (acs_test$yhat_logmod6 > 0.05) * 1
log_mod6_aic <- summary(logmod6)$aic
log_cm6 <- confusionMatrix(as.factor(acs_test$yhat_logmod6), as.factor(acs_test$HighIncome), positive = "1")
log_cm6
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3090 55
## 1 1187 217
##
## Accuracy : 0.727
## 95% CI : (0.7138, 0.7399)
## No Information Rate : 0.9402
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1764
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.79779
## Specificity : 0.72247
## Pos Pred Value : 0.15456
## Neg Pred Value : 0.98251
## Prevalence : 0.05979
## Detection Rate : 0.04770
## Detection Prevalence : 0.30864
## Balanced Accuracy : 0.76013
##
## 'Positive' Class : 1
##
linear_mod1 <- lm(FamilyIncome ~ FamilyType + FoodStamp + OwnRent + HouseCosts + Insurance + ElectricBill +
NumRooms, data=acs_fit)
summary(linear_mod1)
##
## Call:
## lm(formula = FamilyIncome ~ FamilyType + FoodStamp + OwnRent +
## HouseCosts + Insurance + ElectricBill + NumRooms, data = acs_fit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -372890 -39833 -9227 22628 1198390
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.550e+04 2.699e+03 -9.448 < 2e-16 ***
## FamilyTypeMale Head 8.939e+03 3.233e+03 2.765 0.0057 **
## FamilyTypeMarried 3.814e+04 1.866e+03 20.435 < 2e-16 ***
## FoodStampYes -2.614e+04 2.485e+03 -10.519 < 2e-16 ***
## OwnRentOutright 4.074e+04 7.468e+03 5.455 4.95e-08 ***
## OwnRentRented -1.391e+03 2.241e+03 -0.620 0.5350
## HouseCosts 2.738e+01 6.451e-01 42.439 < 2e-16 ***
## Insurance 1.839e+01 7.637e-01 24.078 < 2e-16 ***
## ElectricBill 5.154e+01 6.313e+00 8.163 3.48e-16 ***
## NumRooms 5.537e+03 2.793e+02 19.823 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 83680 on 18186 degrees of freedom
## Multiple R-squared: 0.3138, Adjusted R-squared: 0.3135
## F-statistic: 924.1 on 9 and 18186 DF, p-value: < 2.2e-16
acs_test$lin_mod1_FamilyIncome <- predict(linear_mod1, newdata=acs_test)
acs_test$lin_mod1_HighIncome <- ifelse(acs_test$lin_mod1_FamilyIncome > 250000,1,0)
linear_mod1_rsq <- summary(linear_mod1)$r.sq
linear_mod1_aic <- AIC(linear_mod1)
linear_cm1 <- confusionMatrix(as.factor(acs_test$lin_mod1_HighIncome), as.factor(acs_test$HighIncome), positive = "1")
linear_cm1
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4230 206
## 1 47 66
##
## Accuracy : 0.9444
## 95% CI : (0.9373, 0.9509)
## No Information Rate : 0.9402
## P-Value [Acc > NIR] : 0.123
##
## Kappa : 0.319
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.24265
## Specificity : 0.98901
## Pos Pred Value : 0.58407
## Neg Pred Value : 0.95356
## Prevalence : 0.05979
## Detection Rate : 0.01451
## Detection Prevalence : 0.02484
## Balanced Accuracy : 0.61583
##
## 'Positive' Class : 1
##
# Residual Analysis
summary(acs_test$HighIncome - predict(linear_mod1,newdata=acs_test))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -399700 -138300 -101600 -110700 -76090 28880
linear_mod2 <- lm(FamilyIncome ~ Insurance + HouseCosts + ElectricBill + NumWorkers + FamilyType +
FoodStamp + OwnRent + NumBedrooms + NumChildren + NumRooms + NumPeople +
NumVehicles + Language, data=acs_fit)
summary(linear_mod2)
##
## Call:
## lm(formula = FamilyIncome ~ Insurance + HouseCosts + ElectricBill +
## NumWorkers + FamilyType + FoodStamp + OwnRent + NumBedrooms +
## NumChildren + NumRooms + NumPeople + NumVehicles + Language,
## data = acs_fit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -386370 -37621 -9512 20578 1182262
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.907e+04 4.482e+03 -10.947 < 2e-16 ***
## Insurance 1.891e+01 7.552e-01 25.047 < 2e-16 ***
## HouseCosts 2.791e+01 6.532e-01 42.723 < 2e-16 ***
## ElectricBill 4.849e+01 6.330e+00 7.659 1.96e-14 ***
## NumWorkers 1.865e+04 8.823e+02 21.136 < 2e-16 ***
## FamilyTypeMale Head 7.558e+03 3.179e+03 2.378 0.017433 *
## FamilyTypeMarried 3.036e+04 1.875e+03 16.191 < 2e-16 ***
## FoodStampYes -1.214e+04 2.548e+03 -4.766 1.89e-06 ***
## OwnRentOutright 5.753e+04 7.362e+03 7.814 5.83e-15 ***
## OwnRentRented 7.785e+03 2.241e+03 3.474 0.000513 ***
## NumBedrooms 2.337e+03 7.579e+02 3.083 0.002050 **
## NumChildren 3.963e+03 7.790e+02 5.087 3.67e-07 ***
## NumRooms 4.482e+03 3.381e+02 13.254 < 2e-16 ***
## NumPeople -7.638e+03 7.218e+02 -10.582 < 2e-16 ***
## NumVehicles 7.058e+03 7.393e+02 9.547 < 2e-16 ***
## LanguageEnglish 3.073e+03 3.400e+03 0.904 0.366076
## LanguageOther -1.447e+03 6.383e+03 -0.227 0.820722
## LanguageOther European -1.789e+03 3.850e+03 -0.465 0.642238
## LanguageSpanish -9.829e+03 3.931e+03 -2.500 0.012415 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 82010 on 18177 degrees of freedom
## Multiple R-squared: 0.3413, Adjusted R-squared: 0.3407
## F-statistic: 523.3 on 18 and 18177 DF, p-value: < 2.2e-16
acs_test$lin_mod2_FamilyIncome <- predict(linear_mod2, newdata=acs_test)
acs_test$lin_mod2_HighIncome <- ifelse(acs_test$lin_mod2_FamilyIncome > 250000,1,0)
linear_mod2_rsq <- summary(linear_mod2)$r.sq
linear_mod2_aic <- AIC(linear_mod2)
linear_cm2 <- confusionMatrix(as.factor(acs_test$lin_mod2_HighIncome), as.factor(acs_test$HighIncome), positive = "1")
linear_cm2
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4227 207
## 1 50 65
##
## Accuracy : 0.9435
## 95% CI : (0.9364, 0.95)
## No Information Rate : 0.9402
## P-Value [Acc > NIR] : 0.1827
##
## Kappa : 0.3114
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.23897
## Specificity : 0.98831
## Pos Pred Value : 0.56522
## Neg Pred Value : 0.95332
## Prevalence : 0.05979
## Detection Rate : 0.01429
## Detection Prevalence : 0.02528
## Balanced Accuracy : 0.61364
##
## 'Positive' Class : 1
##
# Residual Analysis
summary(acs_test$HighIncome - predict(linear_mod2,newdata=acs_test))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -437000 -139600 -103600 -110700 -74330 44130
List of all regression models
sprintf("LOGISTIC REGRESSION")
## [1] "LOGISTIC REGRESSION"
sprintf("Logistic model 1: Predicted Accuracy = %.4f Predicted Sensitivity = %.3f AIC = %.1f",
log_cm1$overall['Accuracy'], log_cm1$byClass['Sensitivity'], log_mod1_aic)
## [1] "Logistic model 1: Predicted Accuracy = 0.6324 Predicted Sensitivity = 0.790 AIC = 7118.0"
sprintf("Logistic model 2: Predicted Accuracy = %.4f Predicted Sensitivity = %.3f AIC = %.1f",
log_cm2$overall['Accuracy'], log_cm2$byClass['Sensitivity'], log_mod2_aic)
## [1] "Logistic model 2: Predicted Accuracy = 0.3317 Predicted Sensitivity = 0.897 AIC = 8049.1"
sprintf("Logistic model 3: Predicted Accuracy = %.4f Predicted Sensitivity = %.3f AIC = %.1f",
log_cm3$overall['Accuracy'], log_cm3$byClass['Sensitivity'], log_mod3_aic)
## [1] "Logistic model 3: Predicted Accuracy = 0.7232 Predicted Sensitivity = 0.735 AIC = 7319.5"
sprintf("Logistic model 4: Predicted Accuracy = %.4f Predicted Sensitivity = %.3f AIC = %.1f",
log_cm4$overall['Accuracy'], log_cm4$byClass['Sensitivity'], log_mod4_aic)
## [1] "Logistic model 4: Predicted Accuracy = 0.7191 Predicted Sensitivity = 0.732 AIC = 7412.8"
sprintf("Logistic model 5: Predicted Accuracy = %.4f Predicted Sensitivity = %.3f AIC = %.1f",
log_cm5$overall['Accuracy'], log_cm5$byClass['Sensitivity'], log_mod5_aic)
## [1] "Logistic model 5: Predicted Accuracy = 0.7402 Predicted Sensitivity = 0.816 AIC = 6188.7"
sprintf("Logistic model 6: Predicted Accuracy = %.4f Predicted Sensitivity = %.3f AIC = %.1f",
log_cm6$overall['Accuracy'], log_cm6$byClass['Sensitivity'], log_mod6_aic)
## [1] "Logistic model 6: Predicted Accuracy = 0.7270 Predicted Sensitivity = 0.798 AIC = 6374.8"
sprintf("Logistic model 6: Predicted Accuracy = %.4f Predicted Sensitivity = %.3f AIC = %.1f",
log_cm6$overall['Accuracy'], log_cm6$byClass['Sensitivity'], log_mod6_aic)
## [1] "Logistic model 6: Predicted Accuracy = 0.7270 Predicted Sensitivity = 0.798 AIC = 6374.8"
sprintf(" ")
## [1] " "
sprintf("LINEAR REGRESSION")
## [1] "LINEAR REGRESSION"
sprintf("Linear model 1: Predicted Accuracy = %.4f Predicted Sensitivity = %.3f Adj R-squared = %.3f",
linear_cm1$overall['Accuracy'], linear_cm1$byClass['Sensitivity'], linear_mod1_rsq)
## [1] "Linear model 1: Predicted Accuracy = 0.9444 Predicted Sensitivity = 0.243 Adj R-squared = 0.314"
sprintf("Linear model 2: Predicted Accuracy = %.4f Predicted Sensitivity = %.3f Adj R-squared = %.3f",
linear_cm2$overall['Accuracy'], linear_cm2$byClass['Sensitivity'], linear_mod2_rsq)
## [1] "Linear model 2: Predicted Accuracy = 0.9435 Predicted Sensitivity = 0.239 Adj R-squared = 0.341"
tree1 <- rpart(HighIncome ~ FamilyType + HouseCosts + NumWorkers2 + OwnRent + Insurance + NumWorkers2 +
YearBuilt + NumBedrooms, data=acs_fit, method="class")
rpart.plot(tree1)
##head(predict(tree1))
##head(predict(tree1, type="class"))
tree1_cm <- confusionMatrix(predict(tree1, type="class"), acs_fit$HighIncome, positive = "1")
tree1_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 16903 890
## 1 176 227
##
## Accuracy : 0.9414
## 95% CI : (0.9379, 0.9448)
## No Information Rate : 0.9386
## P-Value [Acc > NIR] : 0.05864
##
## Kappa : 0.2751
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.20322
## Specificity : 0.98969
## Pos Pred Value : 0.56328
## Neg Pred Value : 0.94998
## Prevalence : 0.06139
## Detection Rate : 0.01248
## Detection Prevalence : 0.02215
## Balanced Accuracy : 0.59646
##
## 'Positive' Class : 1
##
# Residual analysis
summary(acs_test$HighIncome - predict(tree1,newdata=acs_test))
## 0 1
## Min. :-0.9580 Min. :-0.563275
## 1st Qu.:-0.9580 1st Qu.:-0.041991
## Median :-0.9580 Median :-0.041991
## Mean :-0.8776 Mean :-0.002771
## 3rd Qu.:-0.9580 3rd Qu.:-0.041991
## Max. : 0.5633 Max. : 0.958009
tree2 <- rpart(HighIncome ~ FoodStamp + Insurance + FamilyType, data=acs_fit, method="class",
control=rpart.control(minsplit=2, cp=0))
rpart.plot(tree2)
##head(predict(tree2))
##head(predict(tree2, type="class"))
tree2_cm <- confusionMatrix(predict(tree2, type="class"), acs_fit$HighIncome, positive = "1")
tree2_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 17062 1086
## 1 17 31
##
## Accuracy : 0.9394
## 95% CI : (0.9358, 0.9428)
## No Information Rate : 0.9386
## P-Value [Acc > NIR] : 0.3397
##
## Kappa : 0.0484
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.027753
## Specificity : 0.999005
## Pos Pred Value : 0.645833
## Neg Pred Value : 0.940159
## Prevalence : 0.061387
## Detection Rate : 0.001704
## Detection Prevalence : 0.002638
## Balanced Accuracy : 0.513379
##
## 'Positive' Class : 1
##
# Residual analysis
summary(acs_test$HighIncome - predict(tree2,newdata=acs_test))
## 0 1
## Min. :-1.0000 Min. :-1.000000
## 1st Qu.:-0.9838 1st Qu.:-0.074370
## Median :-0.9572 Median :-0.022409
## Mean :-0.8779 Mean :-0.002539
## 3rd Qu.:-0.9256 3rd Qu.:-0.016161
## Max. : 0.5714 Max. : 1.000000
tree3 <- rpart(HighIncome ~ Insurance + ElectricBill + HouseCosts, data=acs_fit, method="class",
control=rpart.control(minsplit=2, cp=.005))
rpart.plot(tree3)
##head(predict(tree3))
##head(predict(tree3, type="class"))
tree3_cm <- confusionMatrix(predict(tree3, type="class"), acs_fit$HighIncome, positive = "1")
tree3_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 16991 957
## 1 88 160
##
## Accuracy : 0.9426
## 95% CI : (0.9391, 0.9459)
## No Information Rate : 0.9386
## P-Value [Acc > NIR] : 0.013
##
## Kappa : 0.217
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.143241
## Specificity : 0.994847
## Pos Pred Value : 0.645161
## Neg Pred Value : 0.946679
## Prevalence : 0.061387
## Detection Rate : 0.008793
## Detection Prevalence : 0.013629
## Balanced Accuracy : 0.569044
##
## 'Positive' Class : 1
##
# Residual analysis
summary(acs_test$HighIncome - predict(tree3,newdata=acs_test))
## 0 1
## Min. :-0.9580 Min. :-0.740260
## 1st Qu.:-0.9580 1st Qu.:-0.041991
## Median :-0.9580 Median :-0.041991
## Mean :-0.8779 Mean :-0.002525
## 3rd Qu.:-0.9580 3rd Qu.:-0.041991
## Max. : 0.7403 Max. : 0.958009
tree4 <- rpart(HighIncome ~ Insurance + ElectricBill + HouseCosts + NumBedrooms + NumChildren +
NumPeople + NumRooms + NumVehicles + NumWorkers + FoodStamp + OwnRent + ElectricBill +
HouseCosts, data=acs_fit, method="class", control=rpart.control(minsplit=2, cp=0))
rpart.plot(tree4)
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
##head(predict(tree4))
##head(predict(tree4, type="class"))
tree4_cm <- confusionMatrix(predict(tree4, type="class"), acs_fit$HighIncome, positive = "1")
tree4_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 17079 1
## 1 0 1116
##
## Accuracy : 0.9999
## 95% CI : (0.9997, 1)
## No Information Rate : 0.9386
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9995
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.99910
## Specificity : 1.00000
## Pos Pred Value : 1.00000
## Neg Pred Value : 0.99994
## Prevalence : 0.06139
## Detection Rate : 0.06133
## Detection Prevalence : 0.06133
## Balanced Accuracy : 0.99955
##
## 'Positive' Class : 1
##
# Residual analysis
summary(acs_test$HighIncome - predict(tree4,newdata=acs_test))
## 0 1
## Min. :-1.0000 Min. :-1.00000
## 1st Qu.:-1.0000 1st Qu.: 0.00000
## Median :-1.0000 Median : 0.00000
## Mean :-0.8648 Mean :-0.01561
## 3rd Qu.:-1.0000 3rd Qu.: 0.00000
## Max. : 1.0000 Max. : 1.00000
tree5 <- rpart(HighIncome ~ Insurance + ElectricBill + HouseCosts + NumWorkers2, data=acs_fit,
method="class", control=rpart.control(minsplit=2, cp=.0025))
rpart.plot(tree5)
##head(predict(tree5))
##head(predict(tree5, type="class"))
tree5_cm <- confusionMatrix(predict(tree5, type="class"), acs_fit$HighIncome, positive = "1")
tree5_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 16972 920
## 1 107 197
##
## Accuracy : 0.9436
## 95% CI : (0.9401, 0.9469)
## No Information Rate : 0.9386
## P-Value [Acc > NIR] : 0.002593
##
## Kappa : 0.2578
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.17637
## Specificity : 0.99373
## Pos Pred Value : 0.64803
## Neg Pred Value : 0.94858
## Prevalence : 0.06139
## Detection Rate : 0.01083
## Detection Prevalence : 0.01671
## Balanced Accuracy : 0.58505
##
## 'Positive' Class : 1
##
# Residual analysis
summary(acs_test$HighIncome - predict(tree5,newdata=acs_test))
## 0 1
## Min. :-1.0000 Min. :-0.789474
## 1st Qu.:-0.9580 1st Qu.:-0.041991
## Median :-0.9580 Median :-0.041991
## Mean :-0.8780 Mean :-0.002413
## 3rd Qu.:-0.9580 3rd Qu.:-0.041991
## Max. : 0.7895 Max. : 0.958009
# make predictions using test data
tree1_pred <- predict(tree1, acs_test, type="class" )
tree2_pred <- predict(tree2, acs_test, type="class" )
tree3_pred <- predict(tree3, acs_test, type="class" )
tree4_pred <- predict(tree4, acs_test, type="class" )
tree5_pred <- predict(tree5, acs_test, type="class" )
# Confusion matrices
tree_cm1_pred <- confusionMatrix(tree1_pred, acs_test$HighIncome, positive = "1")
tree_cm2_pred <- confusionMatrix(tree2_pred, acs_test$HighIncome, positive = "1")
tree_cm3_pred <- confusionMatrix(tree3_pred, acs_test$HighIncome, positive = "1")
tree_cm4_pred <- confusionMatrix(tree4_pred, acs_test$HighIncome, positive = "1")
tree_cm5_pred <- confusionMatrix(tree5_pred, acs_test$HighIncome, positive = "1")
# Display comparison of accuracy of each decision tree - Finish updating this section for final output
sprintf("The no information rate = %.4f", tree1_cm$overall[5])
## [1] "The no information rate = 0.9386"
sprintf("Tree1: Fit Accuracy = %.4f Predicted Accuracy = %.4f Predicted Sensitivity = %.4f",tree1_cm$overall['Accuracy'],
tree_cm1_pred$overall['Accuracy'], tree_cm1_pred$byClass['Sensitivity'])
## [1] "Tree1: Fit Accuracy = 0.9414 Predicted Accuracy = 0.9422 Predicted Sensitivity = 0.2169"
sprintf("Tree2: Fit Accuracy = %.4f Predicted Accuracy = %.4f Predicted Sensitivity = %.4f",tree2_cm$overall['Accuracy'],
tree_cm2_pred$overall['Accuracy'], tree_cm2_pred$byClass['Sensitivity'])
## [1] "Tree2: Fit Accuracy = 0.9394 Predicted Accuracy = 0.9391 Predicted Sensitivity = 0.0037"
sprintf("Tree3: Fit Accuracy = %.4f Predicted Accuracy = %.4f Predicted Sensitivity = %.4f",tree3_cm$overall['Accuracy'],
tree_cm3_pred$overall['Accuracy'], tree_cm3_pred$byClass['Sensitivity'])
## [1] "Tree3: Fit Accuracy = 0.9426 Predicted Accuracy = 0.9426 Predicted Sensitivity = 0.1471"
sprintf("Tree4: Fit Accuracy = %.4f Predicted Accuracy = %.4f Predicted Sensitivity = %.4f",tree4_cm$overall['Accuracy'],
tree_cm4_pred$overall['Accuracy'], tree_cm4_pred$byClass['Sensitivity'])
## [1] "Tree4: Fit Accuracy = 0.9999 Predicted Accuracy = 0.8965 Predicted Sensitivity = 0.2647"
sprintf("Tree5: Fit Accuracy = %.4f Predicted Accuracy = %.4f Predicted Sensitivity = %.4f",tree5_cm$overall['Accuracy'],
tree_cm5_pred$overall['Accuracy'], tree_cm5_pred$byClass['Sensitivity'])
## [1] "Tree5: Fit Accuracy = 0.9436 Predicted Accuracy = 0.9424 Predicted Sensitivity = 0.1654"
Model 5 is a close contender to model 3 - they have the highest predicted accuracies BUT very different sensitivities
k-nearest neighbor can only take numerical data
It is recommended (in most all cases) to normalize the data set
# function to normalize data
normalize <- function(x) {
num <- x - min(x)
denom <- max(x) - min(x)
return (num/denom)
}
# create and normalize a new data frame for knn analysis
acs_numericals <- data.frame(acs$NumBedrooms, acs$NumChildren, acs$NumPeople, acs$NumRooms, acs$NumVehicles, acs$NumWorkers, acs$HouseCosts, acs$ElectricBill, acs$Insurance)
acs_norm <- as.data.frame(lapply(acs_numericals[1:8], normalize))
acs_norm$HighIncome1 <- c(acs$HighIncome)
m <- nrow(acs_numericals)
val <- sample(1:m, size = round(m/3))
acsNorm_learn <- acs_norm[-val,]
acsNorm_valid <- acs_norm[val,]
# view new data frame to verify normalization
summary(acs_norm)
## acs.NumBedrooms acs.NumChildren acs.NumPeople acs.NumRooms
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.3750 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.2500
## Median :0.3750 Median :0.0000 Median :0.0625 Median :0.3000
## Mean :0.4232 Mean :0.0751 Mean :0.0869 Mean :0.3087
## 3rd Qu.:0.5000 3rd Qu.:0.1667 3rd Qu.:0.1250 3rd Qu.:0.3500
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## acs.NumVehicles acs.NumWorkers acs.HouseCosts acs.ElectricBill
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.3333 1st Qu.:0.3333 1st Qu.:0.09117 1st Qu.:0.1710
## Median :0.3333 Median :0.6667 Median :0.16878 Median :0.2573
## Mean :0.3521 Mean :0.5815 Mean :0.20836 Mean :0.3006
## 3rd Qu.:0.5000 3rd Qu.:0.6667 3rd Qu.:0.28168 3rd Qu.:0.3782
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000
## HighIncome1
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.06107
## 3rd Qu.:0.00000
## Max. :1.00000
acs_knn1 <- knn(acsNorm_learn[,1:8], acsNorm_valid[,1:8], acsNorm_learn$HighIncome1, k=5, prob = TRUE)
##head(acs_knn1)
pcol1 <- as.character(as.numeric(acsNorm_valid$HighIncome1))
pairs(acsNorm_valid[1:8], pch = pcol1, col = c("green3", "red")
[(acsNorm_valid$HighIncome1 != acs_knn1)+1])
knn1_cm_pred <- confusionMatrix(acs_knn1, acsNorm_valid$HighIncome, positive = "1")
knn1_cm_pred
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7025 407
## 1 77 73
##
## Accuracy : 0.9362
## 95% CI : (0.9304, 0.9416)
## No Information Rate : 0.9367
## P-Value [Acc > NIR] : 0.5866
##
## Kappa : 0.2079
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.152083
## Specificity : 0.989158
## Pos Pred Value : 0.486667
## Neg Pred Value : 0.945237
## Prevalence : 0.063308
## Detection Rate : 0.009628
## Detection Prevalence : 0.019784
## Balanced Accuracy : 0.570621
##
## 'Positive' Class : 1
##
acs_knn2 <- knn(acsNorm_learn[,1:4], acsNorm_valid[,1:4], acsNorm_learn$HighIncome, k=5, prob = TRUE)
##head(acs_knn2)
pcol2 <- as.character(as.numeric(acsNorm_valid$HighIncome1))
pairs(acsNorm_valid[2:5], pch = pcol2, col = c("green3", "red")
[(acsNorm_valid$HighIncome1 != acs_knn2)+1])
knn2_cm_pred <- confusionMatrix(acs_knn2, acsNorm_valid$HighIncome, positive = "1")
knn2_cm_pred
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7071 475
## 1 31 5
##
## Accuracy : 0.9333
## 95% CI : (0.9274, 0.9388)
## No Information Rate : 0.9367
## P-Value [Acc > NIR] : 0.8936
##
## Kappa : 0.0106
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.0104167
## Specificity : 0.9956350
## Pos Pred Value : 0.1388889
## Neg Pred Value : 0.9370527
## Prevalence : 0.0633078
## Detection Rate : 0.0006595
## Detection Prevalence : 0.0047481
## Balanced Accuracy : 0.5030258
##
## 'Positive' Class : 1
##
acs_knn3 <- knn(acsNorm_learn[,6:8], acsNorm_valid[,6:8], acsNorm_learn$HighIncome, k=3, prob = TRUE)
##head(acs_knn3)
pcol3 <- as.character(as.numeric(acsNorm_valid$HighIncome1))
pairs(acsNorm_valid[6:8], pch = pcol3, col = c("green3", "red")
[(acsNorm_valid$HighIncome1 != acs_knn3)+1])
knn3_cm_pred <- confusionMatrix(acs_knn3, acsNorm_valid$HighIncome, positive = "1")
knn3_cm_pred
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7028 408
## 1 74 72
##
## Accuracy : 0.9364
## 95% CI : (0.9307, 0.9418)
## No Information Rate : 0.9367
## P-Value [Acc > NIR] : 0.5496
##
## Kappa : 0.2066
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.150000
## Specificity : 0.989580
## Pos Pred Value : 0.493151
## Neg Pred Value : 0.945132
## Prevalence : 0.063308
## Detection Rate : 0.009496
## Detection Prevalence : 0.019256
## Balanced Accuracy : 0.569790
##
## 'Positive' Class : 1
##
acs_knn4 <- knn(acsNorm_learn[,1:8], acsNorm_valid[,1:8], acsNorm_learn$HighIncome, k=10, prob = TRUE)
##head(acs_knn4)
pcol4 <- as.character(as.numeric(acsNorm_valid$HighIncome1))
pairs(acsNorm_valid[1:8], pch = pcol4, col = c("green3", "red")
[(acsNorm_valid$HighIncome1 != acs_knn4)+1])
knn4_cm_pred <- confusionMatrix(acs_knn4, acsNorm_valid$HighIncome, positive = "1")
knn4_cm_pred
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7061 425
## 1 41 55
##
## Accuracy : 0.9385
## 95% CI : (0.9329, 0.9438)
## No Information Rate : 0.9367
## P-Value [Acc > NIR] : 0.2635
##
## Kappa : 0.1735
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.114583
## Specificity : 0.994227
## Pos Pred Value : 0.572917
## Neg Pred Value : 0.943227
## Prevalence : 0.063308
## Detection Rate : 0.007254
## Detection Prevalence : 0.012662
## Balanced Accuracy : 0.554405
##
## 'Positive' Class : 1
##
acs_knn5 <- knn(acsNorm_learn[,1:8], acsNorm_valid[,1:8], acsNorm_learn$HighIncome, k=25, prob = TRUE)
#head(acs_knn5)
pcol5 <- as.character(as.numeric(acsNorm_valid$HighIncome1))
pairs(acsNorm_valid[1:8], pch = pcol5, col = c("green3", "red")
[(acsNorm_valid$HighIncome1 != acs_knn5)+1])
knn5_cm_pred <- confusionMatrix(acs_knn5, acsNorm_valid$HighIncome, positive = "1")
knn5_cm_pred
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7075 425
## 1 27 55
##
## Accuracy : 0.9404
## 95% CI : (0.9348, 0.9456)
## No Information Rate : 0.9367
## P-Value [Acc > NIR] : 0.09649
##
## Kappa : 0.1806
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.114583
## Specificity : 0.996198
## Pos Pred Value : 0.670732
## Neg Pred Value : 0.943333
## Prevalence : 0.063308
## Detection Rate : 0.007254
## Detection Prevalence : 0.010815
## Balanced Accuracy : 0.555391
##
## 'Positive' Class : 1
##
acs_knn6 <- knn(acsNorm_learn[,1:8], acsNorm_valid[,1:8], acsNorm_learn$HighIncome, k=50, prob = TRUE)
##head(acs_knn6)
pcol6 <- as.character(as.numeric(acsNorm_valid$HighIncome1))
pairs(acsNorm_valid[1:8], pch = pcol6, col = c("green3", "red")
[(acsNorm_valid$HighIncome1 != acs_knn6)+1])
knn6_cm_pred <- confusionMatrix(acs_knn6, acsNorm_valid$HighIncome, positive = "1")
knn6_cm_pred
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7084 434
## 1 18 46
##
## Accuracy : 0.9404
## 95% CI : (0.9348, 0.9456)
## No Information Rate : 0.9367
## P-Value [Acc > NIR] : 0.09649
##
## Kappa : 0.1566
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.095833
## Specificity : 0.997466
## Pos Pred Value : 0.718750
## Neg Pred Value : 0.942272
## Prevalence : 0.063308
## Detection Rate : 0.006067
## Detection Prevalence : 0.008441
## Balanced Accuracy : 0.546649
##
## 'Positive' Class : 1
##
Summary Output of each model for comparison. Displayed values are for the test data set.
How well did each model do compared to the others?
sprintf("The no information rate = %.4f", knn1_cm_pred$overall[5])
## [1] "The no information rate = 0.9367"
sprintf("Knn 1: Predicted Accuracy = %.4f Predicted Sensitivity = %.4f", knn1_cm_pred$overall['Accuracy'],
knn1_cm_pred$byClass['Sensitivity'])
## [1] "Knn 1: Predicted Accuracy = 0.9362 Predicted Sensitivity = 0.1521"
sprintf("Knn 2: Predicted Accuracy = %.4f Predicted Sensitivity = %.4f", knn2_cm_pred$overall['Accuracy'],
knn2_cm_pred$byClass['Sensitivity'])
## [1] "Knn 2: Predicted Accuracy = 0.9333 Predicted Sensitivity = 0.0104"
sprintf("Knn 3: Predicted Accuracy = %.4f Predicted Sensitivity = %.4f", knn3_cm_pred$overall['Accuracy'],
knn3_cm_pred$byClass['Sensitivity'])
## [1] "Knn 3: Predicted Accuracy = 0.9364 Predicted Sensitivity = 0.1500"
sprintf("Knn 4: Predicted Accuracy = %.4f Predicted Sensitivity = %.4f", knn4_cm_pred$overall['Accuracy'],
knn4_cm_pred$byClass['Sensitivity'])
## [1] "Knn 4: Predicted Accuracy = 0.9385 Predicted Sensitivity = 0.1146"
sprintf("Knn 5: Predicted Accuracy = %.4f Predicted Sensitivity = %.4f", knn5_cm_pred$overall['Accuracy'],
knn5_cm_pred$byClass['Sensitivity'])
## [1] "Knn 5: Predicted Accuracy = 0.9404 Predicted Sensitivity = 0.1146"
sprintf("Knn 6: Predicted Accuracy = %.4f Predicted Sensitivity = %.4f", knn6_cm_pred$overall['Accuracy'],
knn6_cm_pred$byClass['Sensitivity'])
## [1] "Knn 6: Predicted Accuracy = 0.9404 Predicted Sensitivity = 0.0958"
I would choose knn model 6 because highest (tested) accuracy. This decision is made with reservation because of low sensitivity. However, all all of these models have low sensitivity, which leads to decision based on accuracy.
As I increase k the accuracy increases until I chose k = 50. One could continue to repeat this process or (loop) to find the best exact k value where change in accuracy = 0 (first derivative of the function).
One thing to be careful of is that the higher the k value the more complex the model. This could lead to a “garbage” model (as I call it), which has low sensitivity and little use outside of this scope.
I would also think it be valuable to change the variables included in knn. I only did this with two sets of variables, and mostly focused on changing the size of k. You could spend much more time tweaking independent variable combinations
Best regression model = linear regression model 2 - based on accuracy .9444 and sensitivity at 0.243
Best decision Tree Decision = model 3 - based on accuracy at .9426 and sensitivity of .1471 (low is not ideal, but predicted the best)
Best knn model = knn model 5 because highest (tested) accuracy at .9404 and sensitivity at .1146. This decision is made with reservation because of low sensitivity. However, all all of the knn models have low sensitivity, which leads to decision based on accuracy alone
Linear Regression model 1! Comparing this model to all other reveals that it has BOTH highest accuracy and sensitivity!
+ Fit accuracy does not result in in most accurate predictions - Decision tree 4 is a good example of overfitting
+ Until a model is tested on new data, a decision on which model to use is risky
+ Logistic models were very ineffective, which was surprising to me
+ I had better luck using linear regression to estimate family income, converting the estimates to a binary, and testing the model
+ The more complicated models can lead to weak predictive power
+ Interestingly, my best model for hw4 was a relatively simple model using the "workhorse" method of linear regression
Comments on logit and linear regression model
My linear model does estimate negative incomes, which is not logically or stastically sound However, the purpose is to find best predictive model. So I ignored this to see how accurate I could predict High Income I also created multiple linear regressions, and chose this one. I based my decision on R-squared and changes in adjusted R- squared as I added/subtracted variables.
Best regression model = linear regression model 1 - based on accuracy above null model of .9444 and sensitivity at .243
Residual analysis of my LM models have a mean very far from 0, with quite a “large” range between min and max. This is not ideal, but for the purposes of stricly predicting as best as possible I can ignore this